NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Multi-Object Hallucination in Vision Language Models

https://doi.org/10.52202/079017-1409

Chai, Joyce; Chen, Xuweiyi; Fouhey, David; Ma, Ziqiao; Qian, Shengyi; Xu, Sihan; Yang, Jianing; Zhang, Xuejun (December 2024, Neural Information Processing Systems Foundation, Inc. (NeurIPS))

Full Text Available
Towards A Holistic Landscape of Situated Theory of Mind in Large Language Models

Ma, Ziqiao; Sansom, Jacob; Peng, Run; Chai, Joyce (November 2023, Findings of Empirical Methods in Natural Language Processing)

Large Language Models (LLMs) have generated considerable interest and debate regarding their potential emergence of Theory of Mind (ToM). Several recent inquiries reveal a lack of robust ToM in these models and pose a pressing demand to develop new benchmarks, as current ones primarily focus on different aspects of ToM and are prone to shortcuts and data leakage. In this position paper, we seek to answer two road-blocking questions: (1) How can we taxonomize a holistic landscape of machine ToM? (2) What is a more effective evaluation protocol for machine ToM? Following psychological studies, we taxonomize machine ToM into 7 mental state categories and delineate existing benchmarks to identify under-explored aspects of ToM. We argue for a holistic and situated evaluation of ToM to break ToM into individual components and treat LLMs as an agent who is physically situated in environments and socially situated in interactions with humans. Such situated evaluation provides a more comprehensive assessment of mental states and potentially mitigates the risk of shortcuts and data leakage. We further present a pilot study in a grid world setup as a proof of concept. We hope this position paper can facilitate future research to integrate ToM with LLMs and offer an intuitive means for researchers to better position their work in the landscape of ToM.
more » « less
Full Text Available
CycleNet: Rethinking Cycle Consistency in Text-Guided Diffusion for Image Manipulation

Xu, Sihan; Ma, Ziqiao; Huang, Yidong; Lee, Honglak; Chai, Joyce (November 2023, Thirty-seventh Conference on Neural Information Processing Systems)

Diffusion models (DMs) have enabled breakthroughs in image synthesis tasks but lack an intuitive interface for consistent image-to-image (I2I) translation. Various methods have been explored to address this issue, including mask-based methods, attention-based methods, and image-conditioning. However, it remains a critical challenge to enable unpaired I2I translation with pre-trained DMs while maintaining satisfying consistency. This paper introduces Cyclenet, a novel but simple method that incorporates cycle consistency into DMs to regularize image manipulation. We validate Cyclenet on unpaired I2I tasks of different granularities. Besides the scene and object level translation, we additionally contribute a multi-domain I2I translation dataset to study the physical state changes of objects. Our empirical studies show that Cyclenet is superior in translation consistency and quality, and can generate high-quality images for out-of-domain distributions with a simple change of the textual prompt. Cyclenet is a practical framework, which is robust even with very limited training data (around 2k) and requires minimal computational resources (1 GPU) to train.
more » « less
Full Text Available
World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models

https://doi.org/10.18653/v1/2023.acl-long.31

Ma, Ziqiao; Pan, Jiayi; Chai, Joyce (July 2023, The 61th Annual Meeting of Association for Computational Linguistics)

The ability to connect language units to their referents in the physical world, referred to as grounding, is crucial to learning and understanding grounded meanings of words. While humans demonstrate fast mapping in new word learning, it remains unclear whether modern vision-language models can truly represent language with their grounded meanings, and how grounding may further bootstrap new word learning. To this end, we introduce Grounded Open Vocabulary Acquisition (GOVA) to examine grounding and bootstrapping in open-world language learning. As an initial attempt, we propose object-oriented BERT (OctoBERT), a novel visually-grounded language model by pre-training on image-text pairs highlighting grounding as an objective. Through extensive experiments and analysis, we demonstrate that OctoBERT is a more coherent and fast grounded word learner, and that the grounding ability acquired during pre-training helps the model to learn unseen words more rapidly and robustly.
more » « less
Full Text Available
DOROTHIE: Spoken Dialogue for Handling Unexpected Situations in Interactive Autonomous Driving Agents

Ma, Ziqiao; VanDerPloeg, Ben; Bara Cristian-Paul; Huang, Yidong; Kim, Eui-In; Gervitz, Felix; Marge, Matthew; Chai, Joyce (January 2022, Findings of EMNLP)

Full Text Available
DANLI: Deliberative Agent for Following Natural Language Instructions

Zhang, Yichi; Yang, Jianing; Pan, Jiayi; Storks, Shane; Devraj, Nikhil; Ma, Ziqiao; Yu, Keunwoo Peter; Bao, Yuwei; Chai, Joyce (January 2022, EMNLP)

Full Text Available

Search for: All records